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Abstract 

Actors in realistic social networks play not one but a number of diverse roles de- 
pending on whom they interact with, and a large number of such role-specific 
interactions collectively determine social communities and their organizations. 
Methods for analyzing social networks should capture these multi-faceted role- 
specific interactions, and, more interestingly, discover the latent organization or 
hierarchy of social communities. We propose a hierarchical Mixed Membership 
Stochastic Blockmodel to model the generation of hierarchies in social commu- 
nities, selective membership of actors to subsets of these communities, and the 
resultant networks due to within- and cross-community interactions. Furthermore, 
to automatically discover these latent structures from social networks, we develop 
a Gibbs sampling algorithm for our model. We conduct extensive validation of 
our model using synthetic networks, and demonstrate the utility of our model in 
real- world datasets such as predator-prey networks and citation networks. 



1 Introduction 

How do the social communities and their self-organization arise from coordinated interactions and 
information sharing among the actors? One way to tap into this question is to understand the latent 
roles and minds of actors which lead to the formation and organization of these communities. We 
are particularly interested in uncovering the functional/sociological underpinning of network actors, 
and their influence on the network modularity and hierarchy. 

Existing methods for inferring actor roles ifTOl [T6l |9) are limited in that they assume each actor 
undertakes a single, invariant role when interacting with other actors. This is in stark constrast with 
real social networks where actors can play multiple roles (or be under multiple cultural influences); 
the specific role being played depends on whom the actor is interacting with. The Mixed membership 
stochastic blockmodel (MMSB) J2) proposed recently captures the multi-faceted, context-specific 
nature of an actor's role. 

However, the presence of hierarchical structures in real-world social networks raises a signifi- 
cant challenge to current network models. The MMSB formalism described above is intrinsi- 
cally flat in its structure — all roles of an actor are equal citizens in terms of their relation- 
ships to each other — so it does not induce hierarchical structures among actors. A hierar- 
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chical structure goes beyond simple clustering by explicitly including organization at all scales 
in a network simultaneously. Taking a corporate email network as example (Figure [TJ, sup- 
pose a finance department employee in the European branch is communicating with an Ameri- 
can branch executive. The European employee might not communicate with a (possibly business- 
oriented) role related to her membership in the European finance department (which normally 
has no business with the American finance department), but simply with a more generic social 
role as a member of the European branch of the same company (which has occasional acquain- 
tance with the American colleagues). However, when this same employee interacts with a Eu- 
ropean human resource employee, she would likely do so with her role as a European finance 
department member. This notion of interaction-specific, multi-community membership accounts 
for otherwise unusual edges — like the interaction between the European finance employee and 
the American executive — and motivates an intuitive yet powerful approach that allows one to 
infer and visualize the semantic underpinnings of every actor and every link between actors. 

We propose a tree-structured hierarchical 
MMSB (hMMSB) to model community- 
hierarchies and the resultant social network 
as a function of the community member- 
ships undertaken by actors during different 
interactions. Our method treats the com- 
munity structure as a hierarchy of super- 
and sub-communities, while accounting for 
the interaction-specific nature of commu- 
nity membership. Moreover, it automati- 
cally determines the appropriate number of 
sub-communities in each community. This 
sets our method apart from spectral cluster- 
ing methods, which typically do not pro- Figure 1: Illustration of a corporate email network/graph 
duce hierarchies, or agglomerative cluster- as a set of nested communities (Left), and the correspond- 
ing methods, which are inherently restricted ing hierarchy of communities (Right). Vertices in the net- 
to binary hierarchies. We will use our model work represent employees, and directed edges represent 
to analyze a biological food web and a emails sent on a particular day. 
physics paper citation network, and exam- 
ine whether the recovered hierarchies agree 

with the known organizational structure of both networks. Extensive simulation-based validation 
of both our model's consistency and the Gibbs -sampling inference algorithm will be conducted to 
ensure our model's feasibility for hierarchical structure identification. 



1.1 Related Work 

While there has been a great deal of work on graph clustering and community structure inference [7 
14, 18,5,8|, probabilistic generative models for hierarchical community formation and latent multi- 
role inference of every actor and link have just started to draw attention. 

As mentioned earlier, the MMSB model [2] enables inference of the latent roles of every actor and 
link in a network, but it cannot capture hierarchical structures of possible communities in the net- 
work. The link prediction model in [ 17 1 employs an Indian Buffet Process prior over actor positions 
in an infinite-dimensional latent feature space. In that respect, it may be thought of as a nonpara- 
metric extension of the MMSB. However, the goal of their model is missing link prediction rather 
than inference of latent organizational structure. 

Recently, Roy et. al. have developed an "annotated hierarchies" model [19| that generalizes an 
infinite relational model lfT3ll for hierarchical group discovery. Their subsequent work on Mondrian 
Processes |20| provides an alternative nonparametric take on the same issues. Both the annotated hi- 
erarchies model and the Mondrian Process relational model discover binary hierarchies. The infinite 
stochastic blockmodel |12| recovers organizational hierarchies of arbitrary branching factor, among 
other types of structure like grids or partitions, but it does not offer detailed semantic underpinnings, 
such as context-dependent role or community membership instantiations, of every network link. 
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2 Probabilistic Hierarchical Community Model 



Under the MMSB \2\, network links can be instantiated by a role-specific local interaction mech- 
anism: the link between each pair of actors, say is instantiated according to the latent role 
specifically undertaken by actor i when it is to interact with j, and that of j when it is to inter- 
act with i. Crucial to the goal of role-prediction for network data, is the so-called "role vector" 
of mixed membership coefficients, which succinctly captures the probabilities of an actor being 
involved in different roles when he/she interacts with another actor. In this paper, we leverage and 
generalize the "role vector" to model context-dependent mixed memberships in communities of dif- 
ferent scales when links between actors are formed. We propose a hierarchical MMSB (hMMSB) 
model for nested community structures underlying social networks, thereupon such structures as 
well as the community implications of every link can be inferred, given the network. In the sequel, 
we begin with a baseline hMMSB that requires a user to pre-specify the granularity of a tree hierar- 
chy of communities, that is, the branching degree from each group to its subgroup at the next level. 
We will then relax this constraint by employing a nested Chinese Restaurant Process, described in 
0, to allow hierarchies of arbitrary granularities with a fixed depth. 

Let G = {V, E} denote a directed graph, where V — {1, . . . , N} is the set of vertices/actors, and 
E is the set of edges/interactions. We adopt the convention E t j = 1 if (i, j) E E else E^ = 0, and 
we ignore self-edges En. Because the graph is directed, E is not necessarily a symmetric matrix. 
Given an edge Eij, we refer to actor i as the donor and actor j as the receiver. 

Under an hMMSB, we assume that a link follows a generative process based on two latent vari- 
ables unique to each actor. The first one is a community membership-path vector (or in short, path) 
Ci, which records the community memberships of this actor at different levels of the community 
hierarchy in the network. Our model requires an integer parameter K, denoting the maximum depth 
to which the hierarchy should be learnt. Given K, we can represent an actor path Cj as a vector of 
positive integers (cfi, c,2, • ■ • , Clk), where Cik denotes the branch taken by the path at level fc > 1. 
We adopt the convention that the hierarchy root sits at level 0. For clarity of explanation, we will 
assume for now that the actor paths Cj are known. Later, we shall relax this assumption by placing a 
suitable nonparametric prior over c*, allowing them to be posteriorly inferred. 

The second variable is a mixed membership vector (in short, MM) 6i, defining the probabilities of an 
actor identifying himself at different community levels when interacting with other actors. We place 
a two-parameter stick-breaking prior O over 0,, so that actors are not required to participate in the 
same number of community levels. Stick breaking constructions work as follows: Consider a stick 
of length 1. Draw Vn ~ Beta(mir, (1 — m)7r). Let On = Vn and let 1 — 6n be the remainder of the 
stick after chopping off this length Vn. To calculate the length 9 i2 , draw V i2 ~ Beta(mir, (1 — m)7r) 
and chop off this fraction of the remainder of the stick, giving 6^2 = V^(l — Vn). Thus Vik is the 
fraction to chop off from the stick's remainder, and Oik is the length of the kth stick that was chopped 
off. In general, we draw Vik ~ Beta(rmr, (1 — m)tv) from k = 1 to k = 00 and the corresponding 
{dikjfLi i s defined below: 

fc-i 

lk = V lk \{{l-V m ) (1) 

This process is known as the two-parameter GEM distribution O and draws from GEM (to, tt) are 
denoted as 6i ~ GEM(m, tt). to > influences the mean of and tt > influences its variance. 
Because the hierarchy is only learnt up to depth K, we truncate the GEM(to,7t) distribution at 
level K. The stick breaking prior makes it more intuitive to bias interactions toward coarser or finer 
levels compared to a Dirichlet prior with either a single parameter (which is not expressive enough), 
or K — 1 parameters (which may be too expressive). 

To instantiate the edge from actor i to j, we introduce interaction level indicators z^j (donor level) 
and Zi^-j (receiver level), which follow multinomial distributions defined by the MM vectors 6i, 6j 
respectively. The pair (q, z^j) specifies actor i's interaction-specific position in the community 
hierarchy, denoted by Ci[zi^j], which contributes to determining his interaction probability with 
actor j. Note that we are not proposing a full hierarchical model for a tree over conditionally iid 
entities, as in a coalescent process 1111 . Our goal is to model finite-depth nested communities of 
entities connected by a network; therefore our latent hierarchy is not a full binary tree over all nodes, 
and our generative process does not output iid nodal attributes, but links between nodes. 
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Now, given interaction level indicators Zi-^j and z^j, and paths cj and Cj of the two actors in- 
volved, we can determine the community identities (not necessarily at the same level) underlying an 
interaction. These community identities specify a community-pair-specific distribution for the link 

E; 



j irj .~ Bernoulli (S,g(ci, Cj, z.^j, Zi^j)), where 



5 • • • i Ci^ 

otherwise 



e-l) = (C 



' > C j 



(2) 



The notation -B Ci ,cj ,z coarBO G B denotes a parameter specifying the interaction probability from 
community Cj [z coa rse] to Cj [z coi 



We shall explain the intuition behind this in more detail below. 



With all these details, we arrive at the following generative process of a network G = {V, E}: 

• For each actor i £ V: 

- Sample actor i's path from a distribution over hierarchies (detailed in the next section). 

- Sample actor i's MM 8i (distribution over community levels): 9i ~ GEM(m, n). 

• For each element B.,.,. of B in the community-compatability matrix: 

- Sample a value for the element B. .,. ~ Beta(Ai , A2). 

• To generate the network, for each directed edge Eij where i, j 6 V , i 7^ j: 

- Sample the donor level Zj_».j ~ Multinomial(^i). 

- Sample the receiver level Zi<_j ~ Multinomial(Sj). 

- Sample the interaction ~ Bernoulli(S_g(ci, Cj, Zi-^j, Zn-j)). 

The overall intuition behind hMMSB is that Cj and specify a hierarchy position for the donor, 
and likewise for the receiver; and we use these positions to determine the interaction probability. 
There are two different scenarios worth mentioning. First, suppose that the donor and receiver po- 
sitions share the same immediate parent in the hierarchy. Our model associates a "compatibility 
matrix" with every parent position in the hierarchy, which gives the probability of interaction be- 
tween any two of its child positions. The set of all compatibility matrices is denoted by B. Since 
the donor and receiver positions share the same parent, we simply look up the appropriate entry in 
B to get the interaction probability. 

In the second case, the donor and receiver posi- 
tions do not share the same parent; this always 
happens when z^j 7^ z^j (i.e. the positions 
are at different depths). We then coarsen both 
levels to their minimum, z C oarsc- The idea is 
that interactions between two hierarchy posi- 
tions should take place at the level of the higher 
(coarser) position. For example, when an ex- 
ecutive interacts with a sales representative, we 
reduce the interaction to one between an exec- 
utive and a generic group of "other employees" 
below executive rank. (It is in theory possible 
to explicitly model all possible interactions be- 
tween communities of arbitrary levels, but this 
will lead to a severe over-parameterization of 
our model. We have decided to postpone a more 
thorough study of this issue to later work.) If 
the new positions now share the same parent, 
then we look up the appropriate element of B 
as before. Otherwise, we define the interaction 
probability to be zero. 



Beta(^ 2 ) 



noLilli(S fl (c / ,Cy,z i ^-,z, v _)) 




GEM(ra,n) 



Figure 2: Graphical model for hMMSB. 



2.1 Infinite hMMSB 

Thus far, we have assumed knowledge of the paths c^. From a combinatorial standpoint, inferring 
these paths is difficult. Because the number of children at each hierarchy position is unlimited, 
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there are infinitely many possible hierarchies or sets of paths {c^}, even with the fixed depth K. 
One might consider using heuristic methods that guess at the number of such children, but doing so 
would defeat the purpose of employing a probabilistic model. 

Our solution is to place an appropriate nonparametric Bayesian prior over a, the Nested Chinese 
Restaurant Process [3 1, or nCRP for short. The nCRP is an extension of the regular Chinese Restau- 
rant Process (CRP), a recursively-defined prior over positive integers. For concreteness, we shall 
use the first level of each actor path, en, to define the CRP: 

{ \{j<i\c J1 =x}\ ~ (z S r . 1 
x e {ci :(l _i )t ij- (3) 
x is the smallest positive integer not in {ci : (j_i),i} 

where 7 > is a "concentration" parameter that controls the probability of drawing new integers, 
and for conciseness we define Ci : u_u .1 = (en, . . . , cu_x)t). The nCRP is essentially a hierarchy 
of CRP priors, beginning with a single CRP prior at the top level. With each unique integer x seen 
at the top-level prior, we associate a child CRP prior with \{i | Cji = x}\ observations, resulting in 
a two-level tree of CRP priors. We can repeat this process ad infinitum on the newly-created child 
priors, resulting in an infinte-level tree of CRP priors, though we only use a if -level nCRP. All CRP 
priors in the nCRP share the same concentration parameter 7. 

Now we can finish describing our generative process: for each actor i € V, we can sample q ~ 
nCRP( 7 ) using the recursive nCRP definition: 

P(Cife = X I Ci : (i_i), C ia: ( fe _i)) = 

\{j<i\ C 3 -,l:(fc-l)=C i ,i : ( fc _l)Ac,-fe=a;} r , , . .v -. 

|{ J <^ls, l!(fc - 1) =c i , 1:(fc _ 1)} | +7 x e ^ I < A c iM*-D = %Hk-i)} 



\{j<i I c J>I: (fc_i)=c 4 ,i,(h_i)}|+7 



(4) 

x is the smallest positive integer not in the above set. 



Figure |2]displays our complete generative process as a graphical model. As a side note, the infinite 
nCRP prior on paths implies that B contains an infinite number of elements (there could be infinitely 
many children at each parent, so the compatibility matrices must be infinite-dimensional). Our Gibbs 
sampler finesses this issue by integrating out B, while the posterior distribution of each B. . .. can be 
recovered from the path and level samples. 



3 Collapsed Gibbs Sampler 

Exact inference on our model is intractable, so we derive a collapsed Gibbs sampling scheme for 
posterior inference. The O's and B's are integrated out for faster mixing, so we only have to sample 
z and c. The sampling equations are provided below. 

Sampling levels The distribution of z.^j conditioned on all other variables is 



[Z 



l->j 



c,z_(, i ^ ) ,E,7,m,7r, A x , A 2 ) 



oc F(E itj ,Zi^j I c,z_( i _^),E_( i)J -),7,TO,7r,Ai, A 2 ) 

= V(E ld I c, z,E_(jj),7,m,7r, Ai, A 2 )P(^^- | c, z_( f «_,-), E_ (lj) , 7, m, %, Ai, A 2 ) 
= p ( E i,j I c,z,E_ (ij -), Ai,A 2 )P(^ j I Zi^-j),™,-*) (5) 

where E_(, ; j ) is the set of all edges except Eij, and = Z-<-i} \ By Beta- 

Bernoulli conjugacy, the first term, for a particular value of z^j, is 

f E ij (a+\ 1 ) + (l-E i:i )(b+\ 2 ) o ,„ v ,„ 

First term = <^ a+fc+A 1+ A 2 Os^ j^u 

I otherwise 

a = \{(x,y) I (x,y) ? A S B (E xy ) = S B (£«) A = 1}| 

6 = I {(!,») I ^ A S B (E xy ) = S B (£«) A ^ = 0}| (6) 

where we have defined the shorthand S b {Eij ) = S b (<k , Cj , , z.^j ) . 
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The second term can be computed by iterated expectation, conditioning on actor i's stick breaking 
lengths Va,...,V ik : 

F(zi^j = k | 7. it (_j),m,-K) 
= E [l{Zi^j = k) | v it (-j), m, tt] 
= E [E [l{zi^j = k) | Vi\,...,Vi k ,Vi,(-j),m,ir\] 



fc-i 



= E V ik - V iu ) | z ii( _j),TO,7r 

«=i 

fe-i 

= E[V ife | z i)( _j),m,7r] J| E[(l - T^„) | z ij( _ i} , to, tt] 

i 



M=l 



TO7T + #[Z^ ( _ J) = fc] t-J (1 -to)7T + #[z^(_ j - ) > M] 
TT + #K(-j) > ^ „Ji ^ + #K(-j) > «] 

Since we have limited the maximum depth to K, we simply ignore the event z^j > K, and 
renormalize the distribution of z^j over the domain {1, . . . , K}. The conditional distribution of 
z^j is derived in similar fashion. 

Sampling paths The distribution of Cj conditioned on all other variables is 

P(cj | c_i,z,E,7,m,7r,Ai, A 2 ) 
cx P(c i ,E( ii .) ;( . ii ) | c_i,z,E_ (i; . ),_(.,»), 7) m,7r,Ai,A 2 ) 

= p ( E (v),(-,») I c,z,E_ (i) .) _ ( . ii) ,7,m,7r,Ai,A 2 )P(c i | c_ i; z, E_ (i> .) 7, to, tt, Ai, A 2 ) 
= P(E (vU ., 4) I c,z,E_ (ii . )i _ ( . ii) ,Ai,A 2 )P(c i I c_,, 7 ) (8) 

where E^ \ r. ^ = {-E^j, | .t = i V y = i} is the set of all edges Eij whose distributions depend on 
Cj, and E_(j ) .) ) _(. ) j) is its complement. The second term can be computed using the recursive nCRP 
definition: 

P(c ife = x I c 1 .( i „ 1 ),c iil:(fc _ 1) ,7) = 

1 ir-^- — — — ttt — 1 a; e {c jfe {] < 1) A c^i^fc-i) = Ci ; i :(fe _i)} 

|{j<*l C ;) -,i : (fe-l)=C»,l:(Jc_l)}|+7 ' V ' (9) 

tt : - tt — x is the smallest positive integer not in the above set. 

|tK« I Cj,i :(fe _i ) =c i , 1:(fe _ 1) }|+7 

while the first term, for a particular value of Cj, is 

!TT r(g B +fe B +Ai+A 2 ) r(g B +r B +Ai)r(/i£j+s B +A 2 ) c xp q / p \ ^ ri 

11 r(9 B +A 1 )r(ft B +A 2 ) r( SB +/ lB +r B + SB +A 1 +A 2 ) V£ ^ fcrj (v), (■,»)> ^l^i?" 
S£B (ii . ),(.,() 

otherwise 

9b = \{(x,y) I E xy e E_ (i)0 A SuCEsj,) = B A E xy = 1 } | 

/is = I {(a;, 3/) I £ x?/ e E_( i) .) ) _(. |i ) A Sb(E xv ) = B A E xy = 0}| 

r B = |{(a;,y) | S xy e E (4j . )i( .^ A S B (E xy ) = B A E xy = 1 } | 

s B = |{(i,|/)|^eE (ii . )iM A S fl (£ IV )=BA£ IS =0}| (10) 

where B (ij . )i( ., i) ={BeB 3(i,j), (By € E (v) , ( .^ A S^-Ey) = B)} is the set of all BeB 
associated with some edge in E^.)^.^) through Ss(). 

4 Simulation 

We evaluate our model's ability to recover hierarchies on simulated data. For all simulations, the 
number of actors TV was 150, the max depth was K = 2 (2 levels plus root) and 6 = (.25, .75) 
for all actors, meaning that actors interact at level 1 25% of the time and level 2 75% of the time. 
Our experiments explore the effect of different compatibility matrices B. We first explore an on- 
diagonal B, where the diagonal elements are much larger than the off-diagonal elements (actors 
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Figure 3: Simulation Results. Figures 3(a) 3(b) 3(c) and 3(d) show quantitative results. Figures 3(e) 3(f) 



|3(g)|3(h)] illustrate results for one on-diagonal network, and Figures [3(1)1 |3(j)| |3(k)| , |3(1)| illustrate results for 
one off-diagonal network. |3(e)|and|3(i)|are the original networks for these two cases (black indicates edge). 



The numbers inside hierarchy nodes are actor counts (nodes of size < 5 are not shown). See text for details. 



tend to interact within their own communities). We also investigate an off-diagonal B, where the 
off-diagonal elements are larger (actors tend to interact outside their own communities). In the "low 
noise" setting the off-diagonal and on-diagonal elements are far apart (to clearly distinguish them), 
while in the "high noise" setting they are closer together. The exact parameters for the 4 types of 
B's are shown below: 

1. on-diagonal, low noise - B on _ diagonal = (.4, .8), B off _ diagona i = (.02, .02); 

2. on-diagonal, high noise - B on - diagona i = (.3, .6), B off -diagonal = (-1, -1); 

3. off-diagonal, low noise - B on - diagona i = (.02, .02), B of f -diagonal = (.4, .8); 

4. off-diagonal, high noise - B on ^ diagonal = (.1, .1), B off ^ diagonal = (.3, .6). 

Bon-diagonai = (a, b) means that actors interacting in the same level- 1 community do so with 
probability a, while actors interacting in the same level-2 community do so with probability b (and 
analogously for B of / -diagonal)- 

We compare our approach to hierarchical spectral clustering (denoted HSpectral). For spectral clus- 
tering, it is unclear how the number of clusters at each node would be selected, so we give it the 
number of lst-level branches as an advantage (and then let it do a binary split at each level-2 node). 
For hMMSB, we fix m = it = Ai = X2 = -5 and search over 7 = {.01, .1, .5, 1, 1.5, 2}, picking 
the value that maximizes the marginal likelihood. 

The Gibbs sampler was run for 1,500 iterations on each experiment. We calculate the Fl score at 
each level k, Fl k = ^Precision* Recall h Recall = , and Precision = TP r / PP . TP 

' K H.ecall+ Precision TP+FN ' TP+FP 

is true positive count (actors that are in the same cluster and should be up to depth k), FP is false 
positive count, TN is true negative count, and FN is false negative count. The total Fl score is 
computed by averaging the Fife scores for each level. 

Figure [3] illustrates the results as a function of the number of branches at the first level of the gener- 
ated tree. The "number of branches at level 1" refers to number of branches of size > 5 (since there 
are often branches of size 1 or 2). For fairness, the "correct" number of level-1 branches given to 
spectral clustering is also the number of branches of size > 5. 



As one can see, in Figure 3(a) and Figure 3(b)| when B is strongly on-diagonal, our algorithm 



performs well, but a little worse than HSpectral (since HSpectral is given the number of level-1 
branches). A specific example is shown in Figures 3(e)[ |3(f)| |3(g)| [3(h)1 where both models perform 



reasonably well. However, when B is strongly off-diagonal (implying that actors tend to interact 
with those in different communities and rather than within their own community), HSpectral per- 
forms poorly. Our method still gives good results (Figure [3(c)] Figure [3(d)| . An example is shown in 
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Figures 3(i) 3(j) 3(k) 3(1) where our model performs accurately while HSpectral essentially divides 
the actors randomly and performs poorly. Thus HMMSB successfully models a variety of commu- 
nity interactions that traditional clustering methods cannot. It can also recover the actor-specific 
interaction levels for a richer network analysis. 



5 Held-out Evaluation 



MMSB number of roles 
Terrorist dataset: held-out log marginal-likelihoods 



We compare our algorithm to MMSB on two real-world 
datasets, a 75-species food web of grass-feeding wasps SID, and 
the September 11th, 2001 hijacker terrorist network lfl5l l4l. Our 
choices reflect two common modes of interaction seen in real-world 
network data: edges in the food web denote predator-prey relation- 
ships, while edges in the terrorist network reflect social cohesion. 
The food web could be represented as a hierarchy where different 
branches reflect different trophic levels (e.g. parasite, predator or 
prey), while the terrorist network could be interpreted as an organi- 
zation chart. 

For each dataset, we generated 5 sets of training and test subgraphs, 
each obtained by randomly partitioning the actors into two equal 
sets. For each training subgraph, we performed a gridsearch over 
the B parameters (Ai,A2) G {.1, .3, .5, .7, .9} 2 according to the 
log marginal likelihood. The remaining parameters were fixed to 
7 = l.m = it = 0.5 and K = 2. We then computed the log 
marginal likelihood on the corresponding 5 test subgraphs, and av- 
eraged them to obtain hMMSB's average held-out likelihood. The 
procedure for MMSB was similar, except that we used 100 random 
restarts of the MMSB variational EM algorithm [2| on the train- 
ing subgraphs. MMSB also requires the number of latent roles 
R as a tuning parameter, so we repeated the experiment for each 
2 < R < 20. For either algorithm, log marginal likelihoods were 
estimated using 10,000 iterations of importance sampling. 

The results are shown in Figure [4] On either dataset, our model's 

held-out likelihood is superior to MMSB for all R. Notably, MMSB's likelihood peaks on both 
datasets at R = 2, but selecting so few roles will lead to an extremely coarse network analysis. In 
contrast, our model automatically recovers a suitable level of hierarchical complexity and enables 
rich interpretations of the data — as we shall demonstrate next. We note that the "annotated hier- 
archies" model [ 19] and infinite stochastic blockmodel [ 12 1 are good candidates for evaluation, but 
the authors did not make code available for computing marginal likelihoods, so we do not include 
those models in our evaluation. 



MMSB 
— Our model 



MMSB number of roles 



Figure 4: Marginal likelihood 
for MMSB and hMMSB (top: 
grass, bottom: terrorist). Dotted 
lines show hMMSB's error bars. 



6 Real-data Qualitative Analysis 

6.1 Grass-feeding Wasp Parasitoids Food Web 

We now use hMMSB to interpret real-world social networks, beginning with the grass dataset from 
earlier. We ran our Gibbs sampler on the full network to infer the community hierarchy structure and 
actor latent community MMs. The model parameters were chosen on the full network via grid search 
over (Ai, A2) € {.1, .3, .5, .7, .9} 2 , according to the log marginal likelihood (estimated using 10,000 
importance samples). The remaining parameters were fixed to 7 = 1, m = 0.5, tt = 0.5 and K = 2. 
We ran our Gibbs sampler on the optimal parameters Ai = 0.1, A2 = 0.5 for 10,000 iterations of 
burn-in, and took 100 samples with a lag time of 5 iterations. Convergence was determined from a 
plateauing log complete likelihood plot. 

The Gibbs samples represent a posterior distribution over paths c;. In order to represent the "aver- 
age" of this posterior, we generated a consensus sample by counting the number of times each pair of 
actors shared the same community hierarchy position, over all samples. Actors that shared positions 
in > 50% of all samples were assigned to the same path in the consensus. For levels z^j and Zi<-j, 
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we simply took the mode over all samples. In a final post-processing step to reduce visual clutter, 
we merged bottom-level (i.e. level-2) communities with < 5 actors into one community under the 
same parent. 

The inferred community hierarchy and MM vectors from our Gibbs sampler are reported in Figure 
[6] We also show the original network, where links Eij = 1 have been augmented with their com- 
munities and interaction levels (missing links E,^ = are not shown). The dataset contains trophic 
level annotations, shown in the hierarchy as counts and in the network as node shapes. 

We see that community 3 contains all grass species, 2 contains most herbivores, and 1 contains most 
parasitoids; in contrast, an assortative clustering algorithm would not discover these multi-partite 
divisions. The "outlier" communities are still more interesting — the herbivore in community 6 is 
the sole prey of the parasitoids in community 4; hMMSB has separated this sub-web from the main 
food-web. Moreover, the herbivores in community 2.2 are the sole prey of species 42 and 41 in 
community 1.1. We also observe that community 5's two apex parasitoids have the largest and 2nd- 
largest range of prey species, while community 1 .2 contains the apex parasitoid with the third-largest 
range. As to why these communities are separated, we note that the parasitoid in sub-community 1.2 
preys on only two herbivores, compared to at least six herbivores for either parasitoid in community 
5. 

The community MM vectors in Figure [6] show the frequency at which each species identifies itself 
with a particular community. Most species identify at the super-community level, though some 
occasionally identify at the sub-community level. In our model, level-2 interactions occur only 
within super-communities, hence they account for fine-grained, within-community interactions. For 
example, the within-community links in community 4, as well as the links from species 65 in sub- 
community 1.2 to other members of community 1, are all level-2 interactions. Although we have 
not shown interaction levels for missing links, the latter are sometimes accounted for by level-2 
interactions (e.g. in community 1). 



6.2 High-energy Physics Citation Network 



HEP dataset adjacency matrix, sorted in hierarchical order 



Finally, we consider a 1,000-paper subgraph of the arXiv high- 
energy physics citation network HI, which we constructed 
by subsampling papers involved in citations from Jan 2002 
through May 2003. 

We applied the same parameter selection, convergence crite- 
ria, and post-processing as the previous dataset, and the opti- 
mal parameters were (Ai = 0.7, A2 = 0.5). Each of the 25 
parameter combinations required less than 6 hours to test on 
a single processor core. We ran our Gibbs sampler for 10,000 
iterations of burn-in, and took 10 samples with a lag time of 
50 iterations. The entire Gibbs sampling procedure completed 
in just under 23 hours on a single processor core. 

The inferred community hierarchy is shown in Figure|7] where 
each sub-community has been annotated with its papers' most 
frequent title wordsj We also show the adjacency matrix in 
Figure [5] permuted to match the order of inferred commu- 
nities. We observe that the 810-paper sub-community has a 
sparse citation pattern, implying that its papers are not spe- 
cific to a particular research topic. This is confirmed by the top 3 keywords: "theory", "field" and 
"quantum", which are general to physics research. The other sub-communities from the same super- 
community are more focused, with top keywords like "supergravity", "string" and "pp-wave" being 
more specific physical concepts. This is also reflected in the adjacency matrix, which is denser 
among these sub-communities. The remaining super-communities form a very dense sub-network 
that is mostly separated from the rest of the graph, suggesting that these papers might come from a 
specific community of researchers, working on a narrow set of research topics. In particular, three 



Figure 5: HEP network: Adjacency 
matrix, permuted according to the 
communities in Figure|7] 



'While this output is reminiscent of topic models, we stress that hMMSB is not a topic model — the actor 
paths are learnt solely from the citation network, independent of the paper contents. 
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Figure 6: Grass network: Top Left: Inferred hierarchy of communities, with community trophic level counts 
at the bottom. Top Right: Community-mixed-membership vectors of each actor. Bottom: Original network. 
Edges show interacting communities (edge head/tail colors match assumed hierarchy underlying the interac- 
tions) and interaction level (1 = solid, 2 = dashed) inferred by hMMSB. Node shapes represent annotated trophic 
levels (see legend in bottom right). 




Figure 7: HEP network: Inferred hierarchy of communities, with the most frequent title keywords at the 
bottom. Community positions (circles) show the number of papers. 



of the sub-communities involve the title keyword "tachyon", which is absent from the large super- 
community. To this point, we have only scratched the surface of the citation network — a deeper 
analysis would involve the abstracts/contents of the papers in the hMMSB-inferred communities. 

7 Conclusion 

We have developed a tree-structured hierarchical Mixed-Membership Stochastic Blockmodel 
(hMMSB) that models social networks in terms of the multiple, hierarchical community mem- 
berships that actors undertake during interactions. Our model automatically infers the number 
of sub-communities in each community while simultaneously recovering the community-mixed- 
memberships of every actor, setting it apart from hierarchy-discovering methods that are restricted 
to binary hierarchies and/or single-role-memberships for actors. Moreover, hMMSB is expressive 
enough to account for non-diagonal community-compatibility matrices, as we have demonstrated 
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through our simulation and grass dataset experiments. On real networks, we show that hMMSB re- 
covers intuitive, mixed-membership community organization. Finally, our collapsed Gibbs sampler 
efficiently scales to medium-sized datasets of around 1,000 actors, completing inference in a single 
day on a single processor core. 
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